Table of Contents

Add new variables with mutate()

Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. That’s the job of mutate().

mutate() always adds new columns at the end of your dataset, but doesn’t change the original dataframe. If you want to keep the outcome of mutate(), or any of the other functions we discussed in the previous notebook, assign the output to a new or existing object.

# Starting with a smaller number of columns
flights_sml <- select(flights, 
                      year:day, 
                      ends_with("delay"), 
                      distance, 
                      air_time
                     )

flights_sml
mutate(flights_sml,
       gain = dep_delay - arr_delay,
       speed = distance / air_time * 60
      )

Alternatively we could’ve used our piping skills:

flights %>%
  select(year:day, 
         ends_with("delay"), 
         distance, 
         air_time
        ) %>%
  mutate(gain = dep_delay - arr_delay,
         speed = distance / air_time * 60)

Note that you can refer to columns that you’ve just created:

mutate(flights_sml,
       gain = dep_delay - arr_delay,
       hours = air_time / 60,
       gain_per_hour = gain / hours
      )

Grouped summaries with summarize()

The last key verb is summarize() (or summarise()). It collapses a data frame to a single row:

# Average delay
summarize(flights, delay = mean(dep_delay, na.rm = TRUE))

What happens if we don’t specify na.rm = TRUE?

group_by()

summarize() is not terribly useful unless we pair it with group_by(). This changes the unit of analysis from the complete dataset to individual groups. Then, when you use the dplyr verbs on a grouped data frame they’ll automatically be applied “by group”. For example, if we applied exactly the same code to a data frame grouped by date, we get the average delay per date:

# Average delay for each day
flights %>%
  group_by(year, month, day) %>%
  summarize(delay = mean(dep_delay, na.rm = TRUE))
`summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)

Examples

How does the average distance to destination impact the arrival delay on average?

delay <- flights %>% 
  group_by(dest) %>%  # group by destination
  summarize(count = n(),  # counting the number of flights (per destination)
            Distance = mean(distance, na.rm = TRUE),  # average distance (per destination)
            Delay = mean(arr_delay, na.rm = TRUE)  # average delay (per destination)
           ) %>%
  filter(count > 20, dest != "HNL")  # keeping destinations with more than 20 flights that are not "HNL"
`summarise()` ungrouping output (override with `.groups` argument)
delay
p <- ggplot(data = delay, mapping = aes(x = Distance, y = Delay)) +
  geom_point(aes(size = count, color = dest), alpha = 1/3) +
  geom_smooth(se = FALSE) +
  geom_text(aes(label=dest),size=2.75) +
  labs(title = "Average distance to destination vs. arrival delay",
      caption = "The circle size shows the number of flights to that destination.")

ggplotly(p + theme(legend.position = "none"))  # Same plot with ggplotly()
`geom_smooth()` using method = 'loess' and formula 'y ~ x'

It looks like delays increase with distance up to ~750 miles and then decrease. Maybe as flights get longer there’s more ability to make up delays in the air?

Can you figure out why we removed “HNL” (Pacific/Honolulu)?


Let’s examine if the number of flights during the summer has increased:

flight_count <- flights %>%
  mutate(date = make_datetime(year, month, day)) %>%
  group_by(date) %>%
  summarize(count = n(),  # counting the number of flights (per day)
            delay = mean(arr_delay, na.rm = TRUE)  # average delay (per day)
           )
`summarise()` ungrouping output (override with `.groups` argument)
p <- ggplot(flight_count, aes(date, count)) +
  geom_line() +
  labs(title = "Number of flights out of NYC") +
  theme_classic()

ggplotly(p)

Interestingly, all of these dips are weekends. Let’s check this by creating a bar chart:

flights %>%
  mutate(date = make_datetime(year, month, day),
        weekday = wday(date, label = TRUE)) %>%
  group_by(weekday) %>%
  summarize(count = n(),  # counting the number of flights (per weekday)
            delay = mean(arr_delay, na.rm = TRUE)  # average delay (per weekday)
           ) %>% 
  ggplot(aes(weekday, count)) + 
    geom_bar(stat="identity", width = 0.5) + 
    labs(title="Number of flights per dat of week",
        x = "Day of Week",
        y = "Count") +
    theme_classic() +
    theme(axis.text.x = element_text(angle=65, vjust=0.6))
`summarise()` ungrouping output (override with `.groups` argument)

LS0tDQp0aXRsZTogIkRhdGEgVHJhbnNmb3JtYXRpb24iDQpzdWJ0aXRsZTogIlBhcnQgMiINCm91dHB1dDogaHRtbF9ub3RlYm9vaw0KLS0tDQoNCiMjIyBUYWJsZSBvZiBDb250ZW50cw0KDQoqIEFkZCBuZXcgdmFyaWFibGVzIHdpdGggYG11dGF0ZSgpYA0KKiBHcm91cGVkIHN1bW1hcmllcyB3aXRoIGBzdW1tYXJpemUoKWANCiogR3JvdXAgYnkgd2l0aCBgZ3JvdXBfYnkoKWANCg0KYGBge3IgZWNobz1GQUxTRX0NCiMgbG9hZGluZyBsaWJyYXJpZXMNCmxpYnJhcnkodGlkeXZlcnNlKQ0KbGlicmFyeShsdWJyaWRhdGUpDQpsaWJyYXJ5KG55Y2ZsaWdodHMxMykNCmxpYnJhcnkocGxvdGx5KQ0KDQojIG1vZGlmeWluZyBjaGFydCBzaXplDQpvcHRpb25zKHJlcHIucGxvdC53aWR0aD01LCByZXByLnBsb3QuaGVpZ2h0PTMpDQpgYGANCg0KIyMgQWRkIG5ldyB2YXJpYWJsZXMgd2l0aCBgbXV0YXRlKClgDQpCZXNpZGVzIHNlbGVjdGluZyBzZXRzIG9mIGV4aXN0aW5nIGNvbHVtbnMsIGl0J3Mgb2Z0ZW4gdXNlZnVsIHRvIGFkZCBuZXcgY29sdW1ucyB0aGF0IGFyZSBmdW5jdGlvbnMgb2YgZXhpc3RpbmcgY29sdW1ucy4gVGhhdCdzIHRoZSBqb2Igb2YgYG11dGF0ZSgpYC4NCg0KYG11dGF0ZSgpYCBhbHdheXMgYWRkcyBuZXcgY29sdW1ucyBhdCB0aGUgZW5kIG9mIHlvdXIgZGF0YXNldCwgYnV0IGRvZXNuJ3QgY2hhbmdlIHRoZSBvcmlnaW5hbCBkYXRhZnJhbWUuIElmIHlvdSB3YW50IHRvIGtlZXAgdGhlIG91dGNvbWUgb2YgYG11dGF0ZSgpYCwgb3IgYW55IG9mIHRoZSBvdGhlciBmdW5jdGlvbnMgd2UgZGlzY3Vzc2VkIGluIHRoZSBwcmV2aW91cyBub3RlYm9vaywgYXNzaWduIHRoZSBvdXRwdXQgdG8gYSBuZXcgb3IgZXhpc3Rpbmcgb2JqZWN0Lg0KDQpgYGB7cn0NCiMgU3RhcnRpbmcgd2l0aCBhIHNtYWxsZXIgbnVtYmVyIG9mIGNvbHVtbnMNCmZsaWdodHNfc21sIDwtIHNlbGVjdChmbGlnaHRzLCANCiAgICAgICAgICAgICAgICAgICAgICB5ZWFyOmRheSwgDQogICAgICAgICAgICAgICAgICAgICAgZW5kc193aXRoKCJkZWxheSIpLCANCiAgICAgICAgICAgICAgICAgICAgICBkaXN0YW5jZSwgDQogICAgICAgICAgICAgICAgICAgICAgYWlyX3RpbWUNCiAgICAgICAgICAgICAgICAgICAgICkNCg0KZmxpZ2h0c19zbWwNCmBgYA0KDQpgYGB7cn0NCm11dGF0ZShmbGlnaHRzX3NtbCwNCiAgICAgICBnYWluID0gZGVwX2RlbGF5IC0gYXJyX2RlbGF5LA0KICAgICAgIHNwZWVkID0gZGlzdGFuY2UgLyBhaXJfdGltZSAqIDYwDQogICAgICApDQpgYGANCg0KQWx0ZXJuYXRpdmVseSB3ZSBjb3VsZCd2ZSB1c2VkIG91ciBwaXBpbmcgc2tpbGxzOg0KDQpgYGB7cn0NCmZsaWdodHMgJT4lDQogIHNlbGVjdCh5ZWFyOmRheSwgDQogICAgICAgICBlbmRzX3dpdGgoImRlbGF5IiksIA0KICAgICAgICAgZGlzdGFuY2UsIA0KICAgICAgICAgYWlyX3RpbWUNCiAgICAgICAgKSAlPiUNCiAgbXV0YXRlKGdhaW4gPSBkZXBfZGVsYXkgLSBhcnJfZGVsYXksDQogICAgICAgICBzcGVlZCA9IGRpc3RhbmNlIC8gYWlyX3RpbWUgKiA2MCkNCmBgYA0KDQpOb3RlIHRoYXQgeW91IGNhbiByZWZlciB0byBjb2x1bW5zIHRoYXQgeW91J3ZlIGp1c3QgY3JlYXRlZDoNCg0KYGBge3J9DQptdXRhdGUoZmxpZ2h0c19zbWwsDQogICAgICAgZ2FpbiA9IGRlcF9kZWxheSAtIGFycl9kZWxheSwNCiAgICAgICBob3VycyA9IGFpcl90aW1lIC8gNjAsDQogICAgICAgZ2Fpbl9wZXJfaG91ciA9IGdhaW4gLyBob3Vycw0KICAgICAgKQ0KYGBgDQoNCi0tLQ0KDQojIyBHcm91cGVkIHN1bW1hcmllcyB3aXRoIGBzdW1tYXJpemUoKWANClRoZSBsYXN0IGtleSB2ZXJiIGlzIGBzdW1tYXJpemUoKWAgKG9yIGBzdW1tYXJpc2UoKWApLiBJdCBjb2xsYXBzZXMgYSBkYXRhIGZyYW1lIHRvIGEgc2luZ2xlIHJvdzoNCg0KYGBge3J9DQojIEF2ZXJhZ2UgZGVsYXkNCnN1bW1hcml6ZShmbGlnaHRzLCBkZWxheSA9IG1lYW4oZGVwX2RlbGF5LCBuYS5ybSA9IFRSVUUpKQ0KYGBgDQoNCldoYXQgaGFwcGVucyBpZiB3ZSBkb24ndCBzcGVjaWZ5IGBuYS5ybSA9IFRSVUVgPw0KDQojIyMgYGdyb3VwX2J5KClgDQoNCmBzdW1tYXJpemUoKWAgaXMgbm90IHRlcnJpYmx5IHVzZWZ1bCB1bmxlc3Mgd2UgcGFpciBpdCB3aXRoIGBncm91cF9ieSgpYC4gVGhpcyBjaGFuZ2VzIHRoZSB1bml0IG9mIGFuYWx5c2lzIGZyb20gdGhlIGNvbXBsZXRlIGRhdGFzZXQgdG8gaW5kaXZpZHVhbCBncm91cHMuIFRoZW4sIHdoZW4geW91IHVzZSB0aGUgZHBseXIgdmVyYnMgb24gYSBncm91cGVkIGRhdGEgZnJhbWUgdGhleSdsbCBhdXRvbWF0aWNhbGx5IGJlIGFwcGxpZWQgImJ5IGdyb3VwIi4gRm9yIGV4YW1wbGUsIGlmIHdlIGFwcGxpZWQgZXhhY3RseSB0aGUgc2FtZSBjb2RlIHRvIGEgZGF0YSBmcmFtZSBncm91cGVkIGJ5IGRhdGUsIHdlIGdldCB0aGUgYXZlcmFnZSBkZWxheSBwZXIgZGF0ZToNCg0KYGBge3J9DQojIEF2ZXJhZ2UgZGVsYXkgZm9yIGVhY2ggZGF5DQpmbGlnaHRzICU+JQ0KICBncm91cF9ieSh5ZWFyLCBtb250aCwgZGF5KSAlPiUNCiAgc3VtbWFyaXplKGRlbGF5ID0gbWVhbihkZXBfZGVsYXksIG5hLnJtID0gVFJVRSkpDQpgYGANCg0KIyMjIEV4YW1wbGVzDQoNCkhvdyBkb2VzIHRoZSBhdmVyYWdlIGRpc3RhbmNlIHRvIGRlc3RpbmF0aW9uIGltcGFjdCB0aGUgYXJyaXZhbCBkZWxheSBvbiBhdmVyYWdlPw0KDQpgYGB7cn0NCmRlbGF5IDwtIGZsaWdodHMgJT4lIA0KICBncm91cF9ieShkZXN0KSAlPiUgICMgZ3JvdXAgYnkgZGVzdGluYXRpb24NCiAgc3VtbWFyaXplKGNvdW50ID0gbigpLCAgIyBjb3VudGluZyB0aGUgbnVtYmVyIG9mIGZsaWdodHMgKHBlciBkZXN0aW5hdGlvbikNCiAgICAgICAgICAgIERpc3RhbmNlID0gbWVhbihkaXN0YW5jZSwgbmEucm0gPSBUUlVFKSwgICMgYXZlcmFnZSBkaXN0YW5jZSAocGVyIGRlc3RpbmF0aW9uKQ0KICAgICAgICAgICAgRGVsYXkgPSBtZWFuKGFycl9kZWxheSwgbmEucm0gPSBUUlVFKSAgIyBhdmVyYWdlIGRlbGF5IChwZXIgZGVzdGluYXRpb24pDQogICAgICAgICAgICkgJT4lDQogIGZpbHRlcihjb3VudCA+IDIwLCBkZXN0ICE9ICJITkwiKSAgIyBrZWVwaW5nIGRlc3RpbmF0aW9ucyB3aXRoIG1vcmUgdGhhbiAyMCBmbGlnaHRzIHRoYXQgYXJlIG5vdCAiSE5MIg0KDQpkZWxheQ0KYGBgDQoNCmBgYHtyfQ0KcCA8LSBnZ3Bsb3QoZGF0YSA9IGRlbGF5LCBtYXBwaW5nID0gYWVzKHggPSBEaXN0YW5jZSwgeSA9IERlbGF5KSkgKw0KICBnZW9tX3BvaW50KGFlcyhzaXplID0gY291bnQsIGNvbG9yID0gZGVzdCksIGFscGhhID0gMS8zKSArDQogIGdlb21fc21vb3RoKHNlID0gRkFMU0UpICsNCiAgZ2VvbV90ZXh0KGFlcyhsYWJlbD1kZXN0KSxzaXplPTIuNzUpICsNCiAgbGFicyh0aXRsZSA9ICJBdmVyYWdlIGRpc3RhbmNlIHRvIGRlc3RpbmF0aW9uIHZzLiBhcnJpdmFsIGRlbGF5IiwNCiAgICAgIGNhcHRpb24gPSAiVGhlIGNpcmNsZSBzaXplIHNob3dzIHRoZSBudW1iZXIgb2YgZmxpZ2h0cyB0byB0aGF0IGRlc3RpbmF0aW9uLiIpDQoNCmdncGxvdGx5KHAgKyB0aGVtZShsZWdlbmQucG9zaXRpb24gPSAibm9uZSIpKSAgIyBTYW1lIHBsb3Qgd2l0aCBnZ3Bsb3RseSgpDQpgYGANCg0KSXQgbG9va3MgbGlrZSBkZWxheXMgaW5jcmVhc2Ugd2l0aCBkaXN0YW5jZSB1cCB0byB+NzUwIG1pbGVzIGFuZCB0aGVuIGRlY3JlYXNlLiBNYXliZSBhcyBmbGlnaHRzIGdldCBsb25nZXIgdGhlcmUncyBtb3JlIGFiaWxpdHkgdG8gbWFrZSB1cCBkZWxheXMgaW4gdGhlIGFpcj8NCg0KQ2FuIHlvdSBmaWd1cmUgb3V0IHdoeSB3ZSByZW1vdmVkICJITkwiIChQYWNpZmljL0hvbm9sdWx1KT8NCg0KLS0tDQoNCkxldCdzIGV4YW1pbmUgaWYgdGhlIG51bWJlciBvZiBmbGlnaHRzIGR1cmluZyB0aGUgc3VtbWVyIGhhcyBpbmNyZWFzZWQ6DQoNCmBgYHtyfQ0KZmxpZ2h0X2NvdW50IDwtIGZsaWdodHMgJT4lDQogIG11dGF0ZShkYXRlID0gbWFrZV9kYXRldGltZSh5ZWFyLCBtb250aCwgZGF5KSkgJT4lDQogIGdyb3VwX2J5KGRhdGUpICU+JQ0KICBzdW1tYXJpemUoY291bnQgPSBuKCksICAjIGNvdW50aW5nIHRoZSBudW1iZXIgb2YgZmxpZ2h0cyAocGVyIGRheSkNCiAgICAgICAgICAgIGRlbGF5ID0gbWVhbihhcnJfZGVsYXksIG5hLnJtID0gVFJVRSkgICMgYXZlcmFnZSBkZWxheSAocGVyIGRheSkNCiAgICAgICAgICAgKQ0KDQpwIDwtIGdncGxvdChmbGlnaHRfY291bnQsIGFlcyhkYXRlLCBjb3VudCkpICsNCiAgZ2VvbV9saW5lKCkgKw0KICBsYWJzKHRpdGxlID0gIk51bWJlciBvZiBmbGlnaHRzIG91dCBvZiBOWUMiKSArDQogIHRoZW1lX2NsYXNzaWMoKQ0KDQpnZ3Bsb3RseShwKQ0KYGBgDQoNCkludGVyZXN0aW5nbHksIGFsbCBvZiB0aGVzZSBkaXBzIGFyZSB3ZWVrZW5kcy4gTGV0J3MgY2hlY2sgdGhpcyBieSBjcmVhdGluZyBhIGJhciBjaGFydDoNCg0KYGBge3J9DQpmbGlnaHRzICU+JQ0KICBtdXRhdGUoZGF0ZSA9IG1ha2VfZGF0ZXRpbWUoeWVhciwgbW9udGgsIGRheSksDQogICAgICAgIHdlZWtkYXkgPSB3ZGF5KGRhdGUsIGxhYmVsID0gVFJVRSkpICU+JQ0KICBncm91cF9ieSh3ZWVrZGF5KSAlPiUNCiAgc3VtbWFyaXplKGNvdW50ID0gbigpLCAgIyBjb3VudGluZyB0aGUgbnVtYmVyIG9mIGZsaWdodHMgKHBlciB3ZWVrZGF5KQ0KICAgICAgICAgICAgZGVsYXkgPSBtZWFuKGFycl9kZWxheSwgbmEucm0gPSBUUlVFKSAgIyBhdmVyYWdlIGRlbGF5IChwZXIgd2Vla2RheSkNCiAgICAgICAgICAgKSAlPiUgDQogIGdncGxvdChhZXMod2Vla2RheSwgY291bnQpKSArIA0KICAgIGdlb21fYmFyKHN0YXQ9ImlkZW50aXR5Iiwgd2lkdGggPSAwLjUpICsgDQogICAgbGFicyh0aXRsZT0iTnVtYmVyIG9mIGZsaWdodHMgcGVyIGRhdCBvZiB3ZWVrIiwNCiAgICAgICAgeCA9ICJEYXkgb2YgV2VlayIsDQogICAgICAgIHkgPSAiQ291bnQiKSArDQogICAgdGhlbWVfY2xhc3NpYygpICsNCiAgICB0aGVtZShheGlzLnRleHQueCA9IGVsZW1lbnRfdGV4dChhbmdsZT02NSwgdmp1c3Q9MC42KSkNCmBgYA0K